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ABSTRACT 

Linear Discriminant Analysis (LDA) is a traditional statistical method which has proven successful on 
classification and dimensionality reduction problems (S) . The procedure is based on an eigenvalue resolution and gives an 
exact solution of the maximum of the inertia but this method fails for a nonlinear problem. 

To solve this problem used kernel Fisher Discriminant analysis (KFDA), carry out Fisher linear Discriminant 
analysis in a high dimensional feature space defined implicitly by a kernel. The performance of KFDA depends on the 
choice of the kernel. 

In this paper, we consider the problem of finding the optimal solution over a given linear kernel function for the 
two primal and dual variable in Fisher Discriminant, this by taking a small sample 20 case about HIV disease by taking 
three factors (Age, Gender, number of Lymphocyte cell) with two level to clear how these observations classified by 
testing this classified using statistic (Rayleigh Coefficient). 

KEYWORDS: Linear Fisher Discriminant, Kernel Fisher Discriminant, Rayleigh Coefficient, Cross-Validation, 
Regularization 

INTRODUCTION 

Fisher’s linear Discriminant separates classes of data by selecting the features that maximize the ratio of projected 
class means to projected intraclass variances. <3) 

The intuition behind Fisher’s linear Discriminant (FLD) consists of looking for a vector of compounds W such 
that, when a set of training samples are projected into it, the class centers are far apart while the spread within each class is 
small, consequently producing a small overlap between classes 111 '. This is done by maximizing a cost function known in 
some contexts as Rayleigh Coefficient, / (w ) . 

Kernel Fisher’s Discriminant (KFD) is a nonlinear station that follows the same principle for Fisher Linear 
Discriminant but in a typically high-dimensional feature space . In this case, the algorithm is reformulated in terms of 

j(fic), where CX is the new direction of Discriminant. The theory of reproducing kernels in Hilbert space* 1 ' gives the 
relation between vectors W Cind (X . In either case, the objective is to determine the most “plausible” direction according 
to the statistic J . <s * demonstrated that KFD can be applied to classification problems with competitive results. KFD shares 
many of the virtues of other kernel based algorithms: the appealing interpretation of a kernel as a mapping of an input to a 



Impact Factor(JCC): 1.8207- This article can be downloaded from www.impactiournals.us 





58 



Chro K. Salih 



high dimensional space and good performance in real life applications, among, the most important. However, it also suffer 
from the deficiencies of kernelized algorithms: the solution will typically include a regularization coefficient to limit model 
complexity and parameter estimation will rely on some from while the latter precludes the use of richer models. 

Recently, KFDA has received a lot of interest in the literature' 14 ' 9) . A main advantage of KFDA over other kernel- 
based methods is that computationally simple: it requires the factorization of the Gram matrix computed with given 
training examples, unlike other methods which solve dense (convex) optimization problems. 

THEORETICAL PART 

Notation and Definitions 

We use X to denote the input or instance set, which is an arbitrary subset of R n , and 1/ = {- ?,+ /} to denote 
the output or class label set. An input-output pair (,r, y ) where X G X and \) G \j is called an example. An example 
is called positive or ( negative) if its class label is +l(— l). We assume that the examples are drawn randomly and 
independently from a fixed, but unknown, probability distribution over X X . 

Asymmetric function K : X X X — > R is called a kernel (function) if it is satisfies the finitely positive semi- 
definite property: for any X l ,X 2 , , X m G X , the Gram matrix G € R mxm , defined by 

G t] = (1) 

is positive semi -definite. Mercer’s theorem (12) tells us that any kernel function K implicitly maps the input set 
X to a high dimensional (possibly infinite) Hilbert space 3~C equipped with the inner product (v)^ ( through a mapping 

</> :X^J{ : 

K{x,z)={</>{x),</>{z),) ;H yx,ze X 

We often write the inner product iff>{x\ (ftiz),')^ as </>{x) T </>{z) , when the Hilbert space is clear from the 
context. This space is called the feature space, and the mapping is called the feature mapping. The depend on the kernel 
function K and will be denoted as <f> K and d~C K . The gram matrix G G R mxm defined in (1), will be denoted G K 
when it is necessary to indicate the dependence on. (7) 

FISHER DISCRIMINANT 



Fisher Discriminant is the earliest approaches to the problem of classification learning. The idea underlying this 
approach is slightly different from the ideas outlined so far, rather than using decomposition P xv = P^ P x we now 

decompose the unknown probability measure constituting the learning problem as P^ = P^P^ . The essential different 
between these two formal expression becomes apparent when considering the model choices : 



In the case of 



P = P, 

*y y|x 



we 



use hypotheses h E 3~[ Cl \J X to model the conditional measure P , of 
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classes y &\f given objects X G X and marginalize over P x in the noise free case, each hypothesis defines 
such a model by P Y |x= v h=/? (y) = K(x)=y • Since our model for learning contains only predictors H ! X — > \) 
that discriminate between objects, this approach is sometimes called the predictive or discriminative approach. 

• In the case of P xy = P x | y P y we model the generation of objects X G X given the class y G y ={- l+i} by 

some assumed probability model P x | y _ vQ _ fl where (0,„0„P)eQ parameterizes this generation process. 

We have the additional parameter P G [0,1] to describe the probability I‘y| 0 -X V ) l >- v 

p-U+(i-p)J j . As the model O contains probability measures from which the generated training 

sample X G X is sampled, this approach is sometimes called the generative or sampling approach. 

In order to classify a new test object X G X with a model in the generative approach we make use of Bayes 
theorem, i.e 



' Y X=i,Q=J 



M= 



^\jX=x,Q=0 U)Py Q J>’ ) 



Syel/ Px|Y=y,Q=5 (-^)Py|Q=0 ) 



def 



In the case of two classes and the zero-one loss l Q -i (h(x),y) = I * ( a)^ y ’ we obtain for Bayes optimal 
classification at a novel test object X G X, 

^W = argmaxP (y) 

y={-i.+i} 

( f „ m ^ 

[x).p_ 

n) 

JJ 



= sign 



In 



■ X Y=+1,Q=« 



( 2 ) 



‘ XY|Q-$ 



Px|y=-i,q=^ P) 

as the fraction of this expression is greater than one if, and only if, P XY | Q=e (x,+l) is greater than 
{x- D in the generative approach the task of learning amounts to finding the parameters 0 G Q or measures 



P , and P , which incur the smallest expected risk R\h , ) by virtue of equation (2). Again, we are faced 

X Y=y,Q=0 Y|Q=0 r v B ’ J n ° 

with the problem that, with out restrictions on the measure P X | Y ^ , the best model is the empirical measure V x (x) 

where X y C X is the sample of all training objects of class y . Obviously, this is a bad model because V x (x) as-signs 

zero probability to all test objects not present in the training sample and thus /l g (x) — 0, i.e. we are unable to make 
predictions on unseen objects. Similarly to the choice of the hypothesis space in the Discriminative model we most 
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constrain the possible generative models P x y= ^ . 

Let us consider the class of probability measures from the exponential family 

p x| Y=,,Q=* k) = a 0 K k k) exp(<9; (*(*))) 

For some fixed function Cl {] ! O — > R, T 0 ! X — > R and T X — > K using this functional form of the 
density we see that each decision function h g must be of the following form 



h e {x) = si, 



sign 



In 



v 



«o(i 


0+i kk) 


lexp( 


0+,kk)))- J p 


a 0 t 


(0-. 


k 1 


k) 


lexpl 


imm-p) 



)) 



y empirical probability measure 

f 



= sign + In 



( 0+1 -0>M+ 



a 



M-P 



a 0 (Ojl~p) 



( 3 ) 



= sign((xv ,t{x)) + b) 



This result is very interesting as it shows that, for a rather large class of generative models, the final classification 
function is a linear function in the model parameters 0 = (0_ l ,0 +l ,p). Now consider the special case that the 

distribution P X |Y-v 0-0 °f objects I£ X given classes y G {-L+l} is a multidimensional Gaussian in some feature 
space K (Z mapped into by some given feature map (j) X — > K , 



l X|Y=v,Q=0 



k) = ( 2 k~ f kf 2 expf - i (x - u ) E _1 (x - p v ) 



( 4 ) 



Where the parameters 0 v are the mean vector [l G R" and the covariance matrix 2 y G R nxn , respectively. 
Making the additional assumption that the covariance matrix £ is the same for both models 0 +1 , 0 ( and 

P Y|Q=k + 0 = P Y|Q=k - l) we see that < 



f 



0 = 



y-' y-‘ 

. 22 . y-i. 

0 ’ _ ’ ^ 12 ’ ’ « ’ ^ 23 ’ ’ 



-i A 



V 



2 ' ' 2 2 
T(x)=k;k’^2’--^x ’ x i ’ x 2 x 3 ;••••; k) 



( 5 ) 
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* 0 M=i 

a o (0) = (2^)”"|E| 2 exp(--jn'E~y) (6) 

according to equations (3, 5, and 6) then, 

T (x) = x, w = S“ 1 ^ +1 -^ 1 ), ^ = |t'_ 1 L _ V_ 1 -^ +1 S> +1 ) (7) 

This result also follows from substituting (4) directly in to equation (2) (see Figure 1: (left) The black line 
represents the decision boundary. This must always be a linear function because both models use the same (estimated) 

covariance matrix L (ellipses). 




Figure 1: Fisher Discriminant 

An appealing feature of this classifier is that it has a clear geometrical interpretation which was proposed for the 
first time by R. A. Fisher. Instead of working with VI -dimensional vectors X we consider only their projection onto a 

hyperplane with normal W6 K. Let fl v (w)= E X | Y _ y [w ^(X)] be the expectation of the projections of mapped 
objects X from class V onto the linear Discriminant having normal W and 
< (w) = E x|Y=y [(w>(X) - n (w)) 2 J the variance of these projections. Then choose as the direction W € K of 
the linear Discriminant a direction along which the maximum of the relative distance between the (1 (w ) is obtained, 
that is, the direction W Fn along which the maximum of 



J(w) = 



(mJ 


w) 


-mJ 


w)’ 


Y 


< 


(w 


)+cr_ 2 , 


(w 


) 



( 8 ) 



is attained. Intuitively, the numerator measures the inter class distance of points from the two classes {+1,-1} 
whereas the denominator measures the intra-class distance points in each of the two classes see also Figure (1) right, that a 
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geometrical interpretation of the Fisher Discriminant objective function (8), given a weight vector W G K , each mapped 
training object X is projected onto W by virtue of t = (x, w) . The objective function measures the ratio of the inter-class 

distance (// +1 (w)— /^(w))' and the intra-class distance cr 3 , (w ) + <T 3 , (w ) .Thus the function J is maximized if the inter- 
class distance is large and the intra-class distance is small. In general, the Fisher linear Discriminant W FD suffer from the 
problem that its determination is a very difficult mathematical and algorithmical problem, However, in the particular case 
of P X | Y Q g = Normal(jU y ,Yj ) 2 , a closed form solution to this problem is obtained by noticing that T = W (/)(X ) 

is also normally distributed with P T y=y Q=g = Normal (\v'jll y , w Zw). Thus, the objective function given in equation 
(8) can be written as 



jt w \ = Mp + i ~ n-i )) 2 = 1 w / ( p +t - P-i )(p + i - P-i ) 
w'lw + w'lw 2 w'lw 



w 



(9) 



Which is known as the generalized Rayleigh quotient having the maximizer W FD , 

w ro =£ -I (n +1 -nJ do) 

This expression equals the weight vector W found by considering the optimal classification under the assumption 

of a multidimensional Gaussian measure for the class conditional distributions P v , v 

X|Y=y 



Unfortunately, as with the discriminative approach, we do not know the 

parameters 0 = (|1 +1 , fl_ 1? E)e Q but have to "learn" them from the given training sample 

z = (x, yR z m . We shall employ the Bayesian idea of expressing our prior belief in certain parameters via some prior 

measure P Q . After having seen the training sample Z we update our prior belief P Q , giving a posterior belief P , m 

^ ^ Q Z —z 

Since we need one particular parameter value we compute the MAP estimate 0 , that is, we choose the value of 0 which 
attains the maximum a-posterior belief P , m 3 . If we choose a (improper) uniform prior P Q then the parameter 

QlZ =z ^ 

0 equals the parameter vector which maximize the likelihood and is therefore also known as the maximum likelihood 
estimator, these estimates are given by 

£v = — !>,’ £ = ^ Z Z( x ,-£v)( x ,-Pv) (id 

VYly {x it y)=Z 2^ j€{+l,-l}(jc,- j)£Z 



XX 



m 



Z A A / 

m u u 

yt~ yt~ y 

ye{+l,-l} 
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2 Note that fl (w)e R is a real number whereas fl (w)e R is an a -dimension vector in feature space . 

3 For details see Linear Kernel Classifiers p.80. 

Where is the data matrix obtained by applying (f) \ X — > K to each training object I£ X and 111 equals the 
number of training examples of class V . Substituting the estimates into the equations (7) results in the so-called Fisher 
Linear Discriminant W FD . The pseudo code of this algorithm is given in Appendix (A) <6, 

KERNEL FISHER DISCRIMINANT 

In an attempt to "kernelize" the algorithm of Fisher Linear Discriminant its note that a crucial requirement is that 

2g R nXn has full rank which is impossible if dim (K) = n m . Since the idea of using kernels reduces 
computational complexity in these cases we see that it is impossible to apply a kernel trick directly to this algorithm. 
Therefore, let us proceed along the following route: Given the data matrix Xg R™ we project the J71 data vectors 
X. G R" into the m-dimensional space spanned by the mapped training objects X — > Xxand then estimate the mean 

/v 

vector and the covariance matrix in R™ using equation f 1 1). The problem with this approach is that X is at most of rank 
111 — 2 because it is un outer product matrix of two centered vectors. In order to remedy this situation we apply the 
technique of regularization to the resulting of in X 111 covariance matrix, i.e. we penalize the diagonal of this matrix by 
adding XL to it where large value of X corresponds to increasing penalization. As a consequence, the projection m- 
dimensional mean vector k v G R and covariance matrix S G R mXm are given by 



k = 1 v Xx. =— g(i _ 

in (xfyiez ' in y ' ' 




s = - 

m 



f 



XXXX Vmkk 

y y 






V 



r ,y y y 

ve{+i,-i} y 



+ XL 



m 



GG - Vmkk 

t—t .y y 3 



r , y y y 

y e {+l.-l} y 



+ XL 



Where the lit X 111 matrix G with G .. = yX, XjJ = kyX^X. j is the Gram matrix. Using k v and S in 
place of (1 v and £ in the equations (7) results so-called kernel Fisher Discriminant. We note that the m-dimensional 

vector computed corresponds to the linear expansion coefficients HG R™ of a weight vector W KFD in feature space 

because the classification of a novel test object X G X by the Kernel Fisher Discriminant is carried out on the projected 
data point Xx ,i.e. 
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h(x)= sign((a,Xx) + b]= signl ^a i k(x i ,x)+b 

a = S"'(k +1 -k ,). £ = i(k: i S-'k: ] -k;,SX 1 ) ( 12 ) 

It is worth mentioning that we would have obtained the same solution by exploiting the fact that the objective 
function (8) depends only on inner products between mapped training objects X ( . and the unknown weight vector W . By 

m 

virtue of Representer Theorem 4 the solution can be written as W FD = Z«* which inserted into (8), yields a 

;=i 

function in (l whose maximizer is given by equation (8). the pseudocode of this algorithm is given in Appendix (B). <6 ' 10) 

RAYLEIGH COEFFICIENT 

To find the optimal linear Discriminant we need to maximize a Rayleigh coefficient ( cf. Equation (9)). Fisher's 
Discriminant can also be interpreted as a feature extraction technique is defined by the separability criterion (8). From this 
point of view, we can think of the Rayleigh coefficient as a general tool to fined features which (i) cover much of what is 
considered to be interesting ( e.g. variance in PCA), (ii) and at the same time avoid what is considered disturbing (e.g. 
within class variance in Fisher's Discriminant). The ratio in (9) is maximized when one covers as much as possible of the 
desired information while avoiding the undesired. We have already shown in Fishers Discriminant that this problem can be 
solved via a generalized eigenproblem. By using the same technique, one can also compute second, third, etc., generalized 
eigenvectors from the generalized eigenproblem, for example in PCA where we are usually looking for more than just one 
feature. (10) 

REGULARIZATION 

The optimizing Rayleigh coefficient for Fisher’s Discriminant in a feature space poses some problems. For 
example if the matrix E is not strictly positive and numerical problems can cause the matrix E not even to be positive 
semi -definite. Furthermore, we know that for successful learning it is 

4 for details see learning Kernel Classifiers p.48 

absolutely mandatory to control the size and complexity of the function class we choose our estimates from. This 
issue was not particularly problematic for linear Discriminant since they already present a rather simple hypothesis class. 
Now, using the kernel trick, we can represent an extremely rich class of possible non-linear solutions, we can always 
achieve a solution with zero within class variance (i.e. w Zw ). Such a solution will, except for pathological cases, be 
over fitting. 

To impose a certain regularity, the simplest possible solution to add a multiple of the identity matrix toE, i.e. 
replace E by E^ where 

L a -L + Al (a>o) 
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This Can Be Viewed in Different Ways 

• If A is sufficiently large this makes the problem feasible and numerically more stable as H A becomes positive 
definite. 

• Increasing A decreases the variance inherent to the estimate E ; for A — > °° the estimate become less and less 
sensitive to the covariance structure. In fact, for A — 00 the solution will lie in the direction of 111-, — Wlj ■ The 
estimate of this "means" however, converges very fast and is very stable (2) . 

• For a suitable choice of A , this can be seen as decreasing the bias in sample based estimation of eigenvalue (4) . 
The point here is, that the empirical eigenvalue of a covariance matrix is not an unbiased estimator of the 
corresponding eigenvalue of the true covariance matrix, i.e. as we see more and more examples, the largest 
eigenvalue does not converge to the largest eigenvalue of the true covariance matrix. One can show that the 
largest eigenvalues are over-estimated and that the smallest eigenvalues are under-estimated 

However, the sum of all eigenvalues (i.e. the trace ) does converge since the estimation of the covariance matrix itself 
(when done properly) is unbiased. 

Another possible regularization strategy would be to add a multiple of the kernel matrix K to S , i.e. replace S with 

Z z =Z + AK (. A>0 ) 



The regularization value compute by using Cross-Validation, the pseudocode of this algorithms given in 
Appendix (C) 

PRACTICAL PART 

Introduction 

In this paper By taking a small sample size 20 observation we want clear how these observations classified by 
testing this classified using statistic (Rayleigh Coefficient), the concepts is maximizing the distance between group means 
with minimizing the distance within groups to obtain the optimal solution by using one of non-linear Fisher Discriminant 
its Kernel Fisher Discriminant In primal and dual variable with two levels (+ l) . 



From Appendix (A), Fisher Discriminant in primal variable by taking A = (J~= 0.57373 computed by 
Generalize (leave one out) cross-validation, see Appendix (C) have a vector of coefficients i.e. 



w = 



-0.000365865255644 

0.701870728355830 

0.000002878439743 

0.127346290424504 

0.000174843224781 

-0.006478024238546 
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/ 15.306248465897635 

W _ 1.956160043675995 
= 7.824640174703855 

b = 2.513798662130466 

From the value of statistic Rayleigh coefficient with primal variable clear that the distance between groups greater 
than the distance within groups means the separate of between-class scatter matrix is maximized and the within-class 
scatter matrix is minimized that is the required and the solution is feasible 

From Appendix (B), Fisher Discriminant in dual variable we obtain on vector of coefficients by taking 

A = 0~ =0.57373 computed by Generalize (leave one out) cross-validation, see Appendix (C) have a vector of 
coefficients i.e. 



0.295508396750165 
-0.066421330540834 
-0.185588761869894 
0.207219565728792 
-0.172647343149947 
-0.091287131039280 
0.153758948367567 
-0.126371566473608 
-0.145210858850987 
-0.203239090402064 
a ~ -0.071985084958214 
0.289147653036707 
-0.082999970338278 
0.309758211838471 
-0.066195126644743 
0.214846717137107 
-0.042010264371129 
-0.173937016302716 
-0.131837467187324 
-0.068577206984628 

/ 22.788451144064311 

“ _ .165849182106410 

= 1 .374046640 1 24 1 04e + 002 

b = -0.366755860887264 

From the value of statistic Rayleigh coefficient with dual variable clear that the distance between groups greater 
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than the distance within groups means the separate of between-class scatter matrix is maximized and the within-class 
scatter matrix is minimized that is the required and the solution is feasible 

CONCLUSIONS 

By taking a small sample 20 case about HIV disease with three factors (Age, Gender, number of Lymphocyte 
cell) with two levels, the value of statistic Rayleigh coefficient with both primal and dual variables clear that the distance 
between groups greater than the distance within groups means the separate of between-class scatter matrix is maximized 
and the within-class scatter matrix is minimized, since both primal and dual solution are feasible then there exist an optimal 
(finite) solution, means the patients are classified in correct. 
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APPENDICES 

Appendix (A) 

Pseudo code for (Fisher Discriminant Analysis in primal variable) 

Require: A feature mapping (f> X — > K (Z £ n 0 

Require: A training sample z = (U, , V, ), , (x m , y m )) 

Determine the number m +l and m_ x of samples of class + 1 and — 1 






w'Sw + w'Sw 2 w'Sw 




Appendix (B) 



Pseudo code for (Fisher Discriminant Analysis in dual variable) 



Require: A training sample z = (fo , Y, )> > ( x ,„ , Y,„ )) 



Require: A kernel function K : X. X X — > R and regularization parameter A G R + 



Determine the number m +l and m_ x of samples of class + 1 and — 1 





S = T(GG-m ■- 




kHk'-J+lI, 



m 
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a = S _1 (k +1 -k_j) 



j{w) 



b = 



1 

2 



_1 w'(k +l -k_, )(k +l -k_. 


) w 


2 w'Sw 






f ^ 


[k.jS^k.j -k +1 S _1 k +1 )+ln 


In +i 







return the vector a of expansion coefficients and offset b G R 

Appendix C 

Algorithm for cross-validation <13) 

S = x(x x) -1 x' 
f{ Xi ) = y = Sy 



GCv(f)=a 2 



ly[ ?,-/(*,) 

N 1- tracers) /N 
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